As a continuation from project 1, I will conduct further analysis on my dog breed datasets. Here, I am working with the tidy version of the joined dataset previously created in project 1, with the data coming from the official AKC breed dataset and a compiled kaggle dataset. There are 104 distinct dog breed entries (with "Giant Schnauzer" having two entries because of two secondary classifications) under the `Breed` variable, with 14 other variables including (but not limited to) `height`, `weight`, `Temperament`, `AvgPupPrice`, and first/second `Classification`. For non-temperament related variables, all variables except for `IntelligenceRating` and `Watchdog` had 105 observations, with the former having 87 observations and the latter having 104 observations. For the `Temperament` variables, separated into distinct attributes, each group had at least 15 observations, with the most being 35 observations (e.g. "Loyal").
library(tidyverse)
dog_data <- read_csv("combined_dog_data.csv")
dog_data_wider <- dog_data %>% pivot_wider(names_from = "Temperament",
values_from = "Indicator") %>% select(-MeanHeight)
# counts for temperament attributes (i.e. positive value
# indicator)
dog_data_wider %>% select(is_numeric) %>% summarize_all(function(x) sum(x)) %>%
select(Alert, Friendly, Affectionate, Loyal, Gentle, Independent,
Protective, Playful)
## # A tibble: 1 x 8
## Alert Friendly Affectionate Loyal Gentle Independent Protective Playful
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 32 32 34 35 25 19 15 27
# overall counts for all observations
dog_data_wider %>% summarise_all(function(x) sum(!is.na(x)))
## # A tibble: 1 x 20
## Breed AvgHeightInches LowHeightInches HighHeightInches AvgWeightLbs
## <int> <int> <int> <int> <int>
## 1 105 105 105 105 105
## # … with 15 more variables: LowWeightLbs <int>, HighWeightLbs <int>,
## # AvgPupPrice <int>, IntelligenceRating <int>, Watchdog <int>,
## # FirstClassification <int>, SecondClassification <int>, Alert <int>,
## # Friendly <int>, Affectionate <int>, Loyal <int>, Gentle <int>,
## # Independent <int>, Protective <int>, Playful <int>
library(cluster)
# Average Height vs. Average Weight
HW_dog_data <- dog_data_wider %>% select(AvgHeightInches, AvgWeightLbs)
sil_width <- vector()
for (i in 2:10) {
kms <- kmeans(HW_dog_data, centers = i)
sil <- silhouette(kms$cluster, dist(HW_dog_data))
sil_width[i] <- mean(sil[, 3])
}
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k",
breaks = 1:10)
# Average Height vs. Average Puppy Price
HP_dog_data <- dog_data_wider %>% select(AvgHeightInches, AvgPupPrice)
sil_width <- vector()
for (i in 2:10) {
kms <- kmeans(HP_dog_data, centers = i)
sil <- silhouette(kms$cluster, dist(HP_dog_data))
sil_width[i] <- mean(sil[, 3])
}
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k",
breaks = 1:10)
# Average Puppy Price vs. Watchdog Rating
PW_dog_data <- dog_data_wider %>% select(AvgPupPrice, Watchdog) %>%
na.omit()
sil_width <- vector()
for (i in 2:10) {
kms <- kmeans(PW_dog_data, centers = i)
sil <- silhouette(kms$cluster, dist(PW_dog_data))
sil_width[i] <- mean(sil[, 3])
}
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k",
breaks = 1:10)
The overall kmeans clustering approximations vary greatly depending on the numerical variables that are chosen. For some comparisons, 3 clusters would yield the highest silhouette width, while at times 8-9 clusters would yield the highest silhouette width. The purpose of this section is to show the variability in appropriate cluster numbers. For the most part, there is not a distinct number of groupings that would yields strong or reasonable silhouette width. That said, knowing there are 7 group classifications for dog breeds, the PAM clustering will proceed with 7 cluster groups.
numeric_dog_data <- dog_data_wider %>% select(AvgHeightInches,
AvgWeightLbs, AvgPupPrice)
set.seed(322)
pam2 <- numeric_dog_data %>% scale %>% pam(7)
pam2
## Medoids:
## ID AvgHeightInches AvgWeightLbs AvgPupPrice
## [1,] 13 1.1030760 2.7169398 1.061092041
## [2,] 15 -0.5955273 -0.7146952 0.965457136
## [3,] 16 0.4854021 0.3834280 0.774187326
## [4,] 37 1.1802852 0.9324896 0.009108086
## [5,] 70 -1.3676196 -1.0166791 -0.469066439
## [6,] 62 -0.2094811 -0.2342663 -0.373431534
## [7,] 85 0.7170298 0.2461626 -0.755971153
## Clustering vector:
## [1] 1 2 3 2 3 1 1 3 2 3 2 2 1 3 2 3 3 2 2 3 2 2 4 4 1 4 3 3 2 3 2 2 4 4 5 4 4
## [38] 4 4 6 6 4 4 1 6 4 6 7 5 5 6 5 6 6 4 7 4 7 7 6 6 6 6 6 5 5 5 5 5 5 5 4 7 7
## [75] 7 5 5 6 5 7 6 6 6 6 7 7 5 5 5 5 7 5 6 6 5 6 7 5 5 7
## [ reached getOption("max.print") -- omitted 5 entries ]
## Objective function:
## build swap
## 0.6417295 0.6199676
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
# Visualization of Pairwise Combinations
library(GGally)
numeric_dog_data <- numeric_dog_data %>% mutate(cluster = as.factor(pam2$clustering))
numeric_dog_data %>% ggpairs(cols = 1:3, aes(color = cluster))
# goodness of fit; silhouette width
plot(pam2, which = 2)
Through PAM clustering focusing on average height, weight, and puppy price, the average silhouette width was 0.34 (weak structure). Using ggpairs, multiple visualization methods demonstrate the comparison between the groups. Across the average height for each group, there are distinct peaks separating each cluster group. Similarly, but less clear, the average weight also shows distinction between each group. When comparing the average height and average weight, among each group, there is an overall correlation of 0.855, indicating a strong positive correlation between height and weight. On the other hand, distinction between groups based on puppy price is not clear. For the most part, the prices are left skewed and similar to one another. There is a weak correlation between puppy price and height or weight. Overall, following the 7 distinct groupings based on the number of
FirstClassification, there is the goodness of fit is weak and considered artificial. This makes sense as these groupings are imposed on dog breeds based on appearance and human-given tasks (i.e. through breeders and AKC officials), and less on natural/biological reasons.
numeric_dog_data <- dog_data_wider %>% select(AvgHeightInches,
AvgWeightLbs, AvgPupPrice, Watchdog)
set.seed(322)
pam2 <- numeric_dog_data %>% scale %>% pam(7)
pam2
## Medoids:
## ID AvgHeightInches AvgWeightLbs AvgPupPrice Watchdog
## [1,] 23 1.0258667 1.7560820 0.487282611 1.6386216
## [2,] 12 -0.6727365 -0.6872421 1.252361850 -0.4312162
## [3,] 34 1.1030760 0.5893261 0.296012801 NA
## [4,] 41 -0.4411088 -0.5499767 0.009108086 0.9486757
## [5,] 70 -1.3676196 -1.0166791 -0.469066439 -0.4312162
## [6,] 62 -0.2094811 -0.2342663 -0.373431534 -0.4312162
## [7,] 80 0.6398205 0.3834280 -0.755971153 -0.4312162
## Clustering vector:
## [1] 1 2 2 2 3 1 3 2 2 3 4 2 1 1 2 3 3 4 2 3 2 5 1 1 1 1 3 3 2 6 4 4 3 3 5 3 3
## [38] 3 3 4 4 3 3 1 6 1 6 7 5 5 6 5 6 3 3 3 3 7 7 6 6 6 6 6 5 5 4 5 5 5 5 1 7 3
## [75] 7 5 5 6 5 7 4 4 6 4 3 7 5 5 5 5 7 5 6 4 5 4 7 5 5 7
## [ reached getOption("max.print") -- omitted 5 entries ]
## Objective function:
## build swap
## 0.8749512 0.8702620
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
# Visualization of Pairwise Combinations
numeric_dog_data <- numeric_dog_data %>% mutate(cluster = as.factor(pam2$clustering))
numeric_dog_data %>% ggpairs(cols = 1:4, aes(color = cluster))
# goodness of fit; silhouette width
plot(pam2, which = 2)
When looking at a PAM clustering focusing on average height, weight, puppy price, and watchdog rating, the average silhouette width was 0.21 (little/no structure). Using ggpairs, multiple visualization methods demonstrate the comparison between the groups. Here, the analysis on the first three variables are the same as the previous section. Focusing on watchdog rating, there lacks a distinct differentiation among the groups. Of the 7 groups, 2 of the groups scored high on overall watchdog ratings, with the other 5 groups overlapping in lesser scores. While there is slight correlation between watchdog rating and puppy price, there is a weak positive correlation between average height and weight. It should also be noted that watchdog rating could be treated more like a categorical variable rather than a numerical variable, which would be better analyzed through the gower method. Overall, following the 7 distinct groupings based on the number of
FirstClassification, there lacks structure and goodness of fit. Analysis over the chosen variables indicate that the selection of 7 breed groupings do not yield groupings based on distinct features but through artificial means.
numeric_dog_data <- dog_data_wider %>% select(AvgHeightInches,
AvgWeightLbs, AvgPupPrice, Watchdog, IntelligenceRating)
set.seed(322)
pam2 <- numeric_dog_data %>% scale %>% pam(7)
pam2
## Medoids:
## ID AvgHeightInches AvgWeightLbs AvgPupPrice Watchdog IntelligenceRating
## [1,] 26 0.9486575 1.61881663 0.4872826 0.9486757 NA
## [2,] 12 -0.6727365 -0.68724211 1.2523619 -0.4312162 0.31144310
## [3,] 34 1.1030760 0.58932612 0.2960128 NA NA
## [4,] 81 0.1765651 -0.02836818 -0.7559712 0.9486757 -0.05121244
## [5,] 65 -1.8308751 -1.15394447 -0.3734315 -0.4312162 NA
## [6,] 47 0.3309836 -0.16563358 -0.1821617 -0.4312162 NA
## [7,] 67 -0.7499457 -0.78332789 -0.3734315 0.2587297 NA
## Clustering vector:
## [1] 1 2 2 2 3 1 1 2 2 3 2 2 1 1 2 3 3 4 2 3 2 5 1 3 1 1 3 6 2 6 7 7 3 3 5 3 3
## [38] 3 3 7 7 1 3 1 6 1 6 6 5 7 6 5 6 3 1 4 3 3 3 6 7 6 7 7 5 5 7 7 7 5 5 1 6 4
## [75] 6 5 7 6 7 6 4 7 6 4 3 6 5 5 5 5 6 5 6 4 5 4 6 5 5 4
## [ reached getOption("max.print") -- omitted 5 entries ]
## Objective function:
## build swap
## 1.055327 1.055327
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
# Visualization of Pairwise Combinations
numeric_dog_data <- numeric_dog_data %>% mutate(cluster = as.factor(pam2$clustering))
numeric_dog_data %>% ggpairs(cols = 1:5, aes(color = cluster))
# goodness of fit; silhouette width
plot(pam2, which = 2)
When looking at a PAM clustering of all the numeric variables, the average silhouette width was 0.12 (little/no structure). Using ggpairs, multiple visualization methods demonstrate the comparison between the groups. Here, the analysis on the first four variables are the same as the previous section. Focusing on intelligence rating, there lacks differentiation among the groups. For the most part, the average intelligence ratings have the same averages with some groups more spread out and some groups more focused at the average/center. Other than with watchdog rating (with a weak negative correlation), intelligence rating does not show any correlation with the other variables. Overall, following the 7 distinct groupings based on the number of FirstClassification, there lacks structure and goodness of fit. At least pertaining to the classification of breeds, it it most likely that intelligence and watchdog rating were not considered when making the distinctions.
set.seed(322)
pamHW <- HW_dog_data %>% pam(k = 7)
pamclustHW <- HW_dog_data %>% mutate(cluster = as.factor(pamHW$clustering))
pamclustHW_all <- dog_data_wider %>% mutate(cluster = as.factor(pamHW$clustering))
# creates color groupings for clusters
pamclustHW %>% ggplot(aes(AvgHeightInches, AvgWeightLbs, color = cluster)) +
geom_point()
# average numeric values for each cluster group
pamclustHW %>% group_by(cluster) %>% summarize_if(is.numeric,
mean, na.rm = T)
## # A tibble: 7 x 3
## cluster AvgHeightInches AvgWeightLbs
## <fct> <dbl> <dbl>
## 1 1 28 157.
## 2 2 15.4 25.9
## 3 3 20.3 48.6
## 4 4 27.6 105.
## 5 5 25.9 77.8
## 6 6 24.1 64.2
## 7 7 10.5 10.8
pamclustHW_all %>% group_by(cluster) %>% summarize_if(is.numeric,
mean, na.rm = T)
## # A tibble: 7 x 18
## cluster AvgHeightInches LowHeightInches HighHeightInches AvgWeightLbs
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 28 27 29 157.
## 2 2 15.4 14.3 16.6 25.9
## 3 3 20.3 18.7 21.9 48.6
## 4 4 27.6 25.5 29.6 105.
## 5 5 25.9 24.5 27.4 77.8
## 6 6 24.1 22.2 26 64.2
## 7 7 10.5 9.55 11.4 10.8
## # … with 13 more variables: LowWeightLbs <dbl>, HighWeightLbs <dbl>,
## # AvgPupPrice <dbl>, IntelligenceRating <dbl>, Watchdog <dbl>, Alert <dbl>,
## # Friendly <dbl>, Affectionate <dbl>, Loyal <dbl>, Gentle <dbl>,
## # Independent <dbl>, Protective <dbl>, Playful <dbl>
# final medoids that are most representative of their cluster
dog_data_wider %>% slice(pamHW$id.med)
## # A tibble: 7 x 20
## Breed AvgHeightInches LowHeightInches HighHeightInches AvgWeightLbs
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Sain… 26.5 25 28 150
## 2 Tibe… 15.5 14 17 25
## 3 Chow… 20.5 19 22 50
## 4 Akita 27 26 28 100
## 5 Weim… 26 25 27 77.5
## 6 Ches… 23.5 21 26 65
## 7 Engl… 10 10 10 12
## # … with 15 more variables: LowWeightLbs <dbl>, HighWeightLbs <dbl>,
## # AvgPupPrice <dbl>, IntelligenceRating <dbl>, Watchdog <dbl>,
## # FirstClassification <chr>, SecondClassification <chr>, Alert <dbl>,
## # Friendly <dbl>, Affectionate <dbl>, Loyal <dbl>, Gentle <dbl>,
## # Independent <dbl>, Protective <dbl>, Playful <dbl>
# original plot of height vs weight colored by first
# classification groupings
dog_data_wider %>% ggplot(aes(AvgHeightInches, AvgWeightLbs,
color = FirstClassification)) + geom_point()
# combination plot of original coloring based on first
# classification and shaping based on PAM clustering
pamclustHW %>% mutate(Grouping = dog_data_wider$FirstClassification) %>%
ggplot(aes(AvgHeightInches, AvgWeightLbs, color = Grouping,
shape = cluster)) + geom_point(size = 4)
# goodness of fit; silhouette width
plot(pamHW, which = 2)
The PAM clustering focusing on average height and weight had an average silhouette width of 0.54 (reasonable structure). The first plot illustrates the PAM clusters on a height vs. weight scatterplot of all the observations. The 7 groupings were distributed as height and weight increased. The following 2 tables illustrate the average values for the numeric variables for each group, showing the differences between the groups. The next table shows the medoids (i.e. dog breeds) that are most representative of their groupings and their respective numeric variables (and corresponding values). The second plot is the scatterplot and coloring based on the actual groupings from AKC. The following plot combines the first and second plot, with the color representing the actual groupings and the different shapes as the PAM clusters. For the most part, the PAM clustering was reasonably accurate in when looking at average height and weight. It is likely that height and weight were contributing factors when classifying dog breeds into their first classification.
set.seed(322)
pamHP <- HP_dog_data %>% pam(k = 7)
pamclustHP <- HP_dog_data %>% mutate(cluster = as.factor(pamHP$clustering))
pamclustHP_all <- dog_data_wider %>% mutate(cluster = as.factor(pamHP$clustering))
# creates color groupings for clusters
pamclustHP %>% ggplot(aes(AvgHeightInches, AvgPupPrice, color = cluster)) +
geom_point()
# average numeric values for each cluster group
pamclustHP %>% group_by(cluster) %>% summarize_if(is.numeric,
mean, na.rm = T)
## # A tibble: 7 x 3
## cluster AvgHeightInches AvgPupPrice
## <fct> <dbl> <dbl>
## 1 1 18.4 2788.
## 2 2 22.8 1919.
## 3 3 20.4 1550
## 4 4 20.8 1321.
## 5 5 19.6 961.
## 6 6 18.0 712.
## 7 7 16.2 445
pamclustHP_all %>% group_by(cluster) %>% summarize_if(is.numeric,
mean, na.rm = T)
## # A tibble: 7 x 18
## cluster AvgHeightInches LowHeightInches HighHeightInches AvgWeightLbs
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 18.4 17.5 19.2 67.8
## 2 2 22.8 21.1 24.4 61.4
## 3 3 20.4 18.9 21.9 64.8
## 4 4 20.8 19.5 22.1 65
## 5 5 19.6 18.1 21.1 49.5
## 6 6 18.0 16.5 19.5 40.3
## 7 7 16.2 15 17.4 33.8
## # … with 13 more variables: LowWeightLbs <dbl>, HighWeightLbs <dbl>,
## # AvgPupPrice <dbl>, IntelligenceRating <dbl>, Watchdog <dbl>, Alert <dbl>,
## # Friendly <dbl>, Affectionate <dbl>, Loyal <dbl>, Gentle <dbl>,
## # Independent <dbl>, Protective <dbl>, Playful <dbl>
# final medoids that are most representative of their cluster
dog_data_wider %>% slice(pamHP$id.med)
## # A tibble: 7 x 20
## Breed AvgHeightInches LowHeightInches HighHeightInches AvgWeightLbs
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Port… 21.5 20 23 51
## 2 Phar… 23 21 25 50
## 3 Labr… 22.5 21 24 67.5
## 4 Beau… 25.5 24 27 110
## 5 Sibe… 21.5 20 23 50
## 6 Bord… 20 19 21 40
## 7 Poin… 22.5 21 24 55
## # … with 15 more variables: LowWeightLbs <dbl>, HighWeightLbs <dbl>,
## # AvgPupPrice <dbl>, IntelligenceRating <dbl>, Watchdog <dbl>,
## # FirstClassification <chr>, SecondClassification <chr>, Alert <dbl>,
## # Friendly <dbl>, Affectionate <dbl>, Loyal <dbl>, Gentle <dbl>,
## # Independent <dbl>, Protective <dbl>, Playful <dbl>
# original plot of height vs puppy price colored by first
# classification groupings
dog_data_wider %>% ggplot(aes(AvgHeightInches, AvgPupPrice, color = FirstClassification)) +
geom_point()
# combination plot of original coloring based on first
# classification and shaping based on PAM clustering
pamclustHP %>% mutate(Grouping = dog_data_wider$FirstClassification) %>%
ggplot(aes(AvgHeightInches, AvgPupPrice, color = Grouping,
shape = cluster)) + geom_point(size = 4)
# goodness of fit; silhouette width
plot(pamHP, which = 2)
The PAM clustering focusing on average height and puppy price had an average silhouette width of 0.6 (reasonable structure). The first plot illustrates the PAM clusters on a height vs. puppy price scatterplot of all the observations. The 7 groupings were distributed as puppy price increased. The following 2 tables illustrate the average values for the numeric variables for each group, showing the differences between the groups. The next table shows the medoids (i.e. dog breeds) that are most representative of their groupings and their respective numeric variables (and corresponding values). The second plot is the scatterplot and coloring based on the actual groupings from AKC. The following plot combines the first and second plot, with the color representing the actual groupings and the different shapes as the PAM clusters. From a visual analysis, these PAM clusters do not appear as accurate as the previous section, however, the average silhouette width is higher in this PAM clustering. Compared to the previous section, this PAM clustering has more reasonable structure.
PW_dog_data <- dog_data_wider %>% select(AvgPupPrice, Watchdog)
set.seed(322)
pamPW <- PW_dog_data %>% pam(k = 7)
pamclustPW <- PW_dog_data %>% mutate(cluster = as.factor(pamPW$clustering))
pamclustPW_all <- dog_data_wider %>% mutate(cluster = as.factor(pamPW$clustering))
# creates color groupings for clusters
pamclustPW %>% ggplot(aes(AvgPupPrice, Watchdog, color = cluster)) +
geom_point()
# average numeric values for each cluster group
pamclustPW %>% group_by(cluster) %>% summarize_if(is.numeric,
mean, na.rm = T)
## # A tibble: 7 x 3
## cluster AvgPupPrice Watchdog
## <fct> <dbl> <dbl>
## 1 1 2680 3.4
## 2 2 1795 3.8
## 3 3 1365 4.37
## 4 4 1032. 3.21
## 5 5 880 4.05
## 6 6 700 3.24
## 7 7 445 2.9
pamclustPW_all %>% group_by(cluster) %>% summarize_if(is.numeric,
mean, na.rm = T)
## # A tibble: 7 x 18
## cluster AvgHeightInches LowHeightInches HighHeightInches AvgWeightLbs
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 19.9 19 20.8 65.2
## 2 2 22.2 20.5 24 71.2
## 3 3 20.2 18.9 21.4 60.3
## 4 4 22.2 20.6 23.8 58.8
## 5 5 17.7 16.3 19.1 44.4
## 6 6 17.6 16.1 19.1 36.7
## 7 7 16.2 15 17.4 33.8
## # … with 13 more variables: LowWeightLbs <dbl>, HighWeightLbs <dbl>,
## # AvgPupPrice <dbl>, IntelligenceRating <dbl>, Watchdog <dbl>, Alert <dbl>,
## # Friendly <dbl>, Affectionate <dbl>, Loyal <dbl>, Gentle <dbl>,
## # Independent <dbl>, Protective <dbl>, Playful <dbl>
# final medoids that are most representative of their cluster
dog_data_wider %>% slice(pamPW$id.med)
## # A tibble: 7 x 20
## Breed AvgHeightInches LowHeightInches HighHeightInches AvgWeightLbs
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Port… 21.5 20 23 51
## 2 Weim… 26 25 27 77.5
## 3 Bull… 21.5 21 22 60
## 4 Vizs… 22.4 18.9 26.0 23.5
## 5 Gord… 25 23 27 62.5
## 6 Pome… 12 12 12 5
## 7 Poin… 22.5 21 24 55
## # … with 15 more variables: LowWeightLbs <dbl>, HighWeightLbs <dbl>,
## # AvgPupPrice <dbl>, IntelligenceRating <dbl>, Watchdog <dbl>,
## # FirstClassification <chr>, SecondClassification <chr>, Alert <dbl>,
## # Friendly <dbl>, Affectionate <dbl>, Loyal <dbl>, Gentle <dbl>,
## # Independent <dbl>, Protective <dbl>, Playful <dbl>
# original plot of puppy price vs watchdog rating colored by
# first classification groupings
dog_data_wider %>% ggplot(aes(AvgPupPrice, Watchdog, color = FirstClassification)) +
geom_point()
# combination plot of original coloring based on first
# classification and shaping based on PAM clustering
pamclustPW %>% mutate(Grouping = dog_data_wider$FirstClassification) %>%
ggplot(aes(AvgPupPrice, Watchdog, color = Grouping, shape = cluster)) +
geom_point(size = 4)
# goodness of fit; silhouette width
plot(pamPW, which = 2)
The PAM clustering focusing on average puppy price and watchdog rating had an average silhouette width of 0.61 (reasonable structure). The first plot illustrates the PAM clusters on a height vs. puppy price scatterplot of all the observations. The 7 groupings were distributed as puppy price increased. The following 2 tables illustrate the average values for the numeric variables for each group, showing the differences between the groups. The next table shows the medoids (i.e. dog breeds) that are most representative of their groupings and their respective numeric variables (and corresponding values). The second plot is the scatterplot and coloring based on the actual groupings from AKC. The following plot combines the first and second plot, with the color representing the actual groupings and the different shapes as the PAM clusters. Similar to the previous section, the visual analysis of these PAM clusters do not appear as accurate as the PAM section for “Average Height vs. Average Weight”, however, the average silhouette width is the highest for this PAM clustering. Compared to the previous two sections, this PAM clustering has more reasonable structure.
dog_data_wider2 <- dog_data_wider %>% select(Breed, AvgHeightInches,
AvgWeightLbs, AvgPupPrice) %>% distinct()
# create dissimilarity matrix
dog_data_wider2 <- dog_data_wider2 %>% mutate_if(is.character,
as.factor) %>% column_to_rownames("Breed")
gower1 <- daisy(dog_data_wider2, metric = "gower")
# Choose Number of Clusters with PAM: Silhouette Width
sil_width <- vector()
for (i in 2:10) {
pam_fit <- pam(gower1, diss = TRUE, k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k",
breaks = 1:10)
# PAM Clustering with Gower Dissimilarities
pam3 <- pam(gower1, k = 2, diss = T)
# final medoids that are most representative of their cluster
dog_data_wider %>% slice(pam3$id.med)
## # A tibble: 2 x 20
## Breed AvgHeightInches LowHeightInches HighHeightInches AvgWeightLbs
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Borz… 27 26 28 85
## 2 Seal… 12 12 12 19
## # … with 15 more variables: LowWeightLbs <dbl>, HighWeightLbs <dbl>,
## # AvgPupPrice <dbl>, IntelligenceRating <dbl>, Watchdog <dbl>,
## # FirstClassification <chr>, SecondClassification <chr>, Alert <dbl>,
## # Friendly <dbl>, Affectionate <dbl>, Loyal <dbl>, Gentle <dbl>,
## # Independent <dbl>, Protective <dbl>, Playful <dbl>
# goodness of fit; silhouette width
plot(pam3, which = 2)
# Visualization of Pairwise Combinations
dog_data_wider2 <- dog_data_wider2 %>% mutate(cluster = as.factor(pam3$clustering))
dog_data_wider2 %>% ggpairs(cols = 1:4, aes(color = cluster))
Here, instead of following the 7 cluster groups as suggested by AKC breed FirstClassification, a cluster group of 2 was chosen due to it’s significantly higher silhouette width, which was 0.45. The following table shows the medoids (i.e. dog breeds) that are most representative of their groups and their respective numeric variables (and corresponding values). The two groups are distinctly different sizes, with the first group having a taller height and heavier weight than the second group. The next plot illustrates the average silhouette width of 0.45, which is a weak goodness of fit. The last plot (using ggpairs) shows the differences between the two groupings across average height, weight, and puppy price. Across all variables, the first group tends to have higher values than group two. While there is weak correlation between average puppy price and height/weight, there is a strong positive correlation between height and weight (as expected).
# standardize data
numeric_dog_data <- dog_data_wider %>% select(Breed, AvgHeightInches,
AvgWeightLbs, AvgPupPrice) %>% distinct()
temp <- dog_data_wider %>% select(Breed, FirstClassification,
AvgHeightInches, AvgWeightLbs, AvgPupPrice) %>% distinct()
numeric_dog_data <- numeric_dog_data %>% select(-Breed)
numeric_dog_data <- data.frame(scale(numeric_dog_data))
rownames(numeric_dog_data) <- temp$Breed
# PCA run-through
pca1 <- prcomp(numeric_dog_data, center = T, scale = T)
summary(pca1)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.4054 0.9417 0.3715
## Proportion of Variance 0.6584 0.2956 0.0460
## Cumulative Proportion 0.6584 0.9540 1.0000
# Number of PCAs
dog_pca <- princomp(numeric_dog_data)
eigval <- dog_pca$sdev^2
varprop = round(eigval/sum(eigval), 2)
ggplot() + geom_bar(aes(y = varprop, x = 1:3), stat = "identity") +
xlab("") + geom_path(aes(y = varprop, x = 1:3)) + geom_text(aes(x = 1:3,
y = varprop, label = round(varprop, 2)), vjust = 1, col = "white",
size = 5) + scale_y_continuous(breaks = seq(0, 0.6, 0.2),
labels = scales::percent) + scale_x_continuous(breaks = 1:10)
round(cumsum(eigval)/sum(eigval), 2)
## Comp.1 Comp.2 Comp.3
## 0.66 0.95 1.00
# Plot PCA scores
dogdf <- data.frame(Name = temp$Breed, PC1 = dog_pca$scores[,
1], PC2 = dog_pca$scores[, 2])
ggplot(dogdf, aes(PC1, PC2)) + geom_point()
# Colored on first classification of breeds
temp <- temp %>% mutate(PC1 = dog_pca$scores[, 1]) %>% mutate(PC2 = dog_pca$scores[,
2])
temp %>% ggplot(aes(x = PC1, y = PC2, color = FirstClassification)) +
geom_point()
# highest on PC1
dog_pca$scores[, 1:3] %>% as.data.frame %>% top_n(3, Comp.1)
## Comp.1 Comp.2 Comp.3
## Tibetan Mastiff 3.710203 -2.6626901 -1.10866676
## Irish Wolfhound 3.030720 -0.5588457 0.08596663
## Mastiff 3.535800 0.5781411 -1.55409026
# lowest PC1
dog_pca$scores[, 1:3] %>% as.data.frame %>% top_n(-3, Comp.1)
## Comp.1 Comp.2 Comp.3
## Chihuahua -2.323342 -0.05791309 -0.3990251
## Papillon -2.171822 0.41047196 -0.2976064
## Japanese Chin -2.234822 0.58937347 -0.3145928
# highest on PC2
dog_pca$scores[, 1:3] %>% as.data.frame %>% top_n(3, wt = Comp.2)
## Comp.1 Comp.2 Comp.3
## English Setter 0.4231534 1.379214 0.28094736
## Plott Hound -0.1764302 1.361982 0.18836318
## Harrier -0.4114127 1.359473 -0.03397673
# lowest on PC2
dog_pca$scores[, 1:3] %>% as.data.frame %>% top_n(3, wt = desc(Comp.2))
## Comp.1 Comp.2 Comp.3
## French Bulldog -0.1157225 -3.900273 0.046694394
## Portuguese Water Dog 1.2086945 -2.681636 0.497482494
## Sussex Spaniel 0.3464221 -2.728736 -0.002820067
Focusing on average height, weight, and puppy price, three PCs can be used to describe the entirety of the given dataset. That said, only PC1 and PC2 are needed to described 95% of the variance in the data, with PC1 encompassing 66% and PC2 encompassing 30%. This is demonstrated in the bar plot, illustrating the proportion of variance each PC describes. The first scatterplot plots the observations of each dog breed across PC1 and PC2, with no apparent trend or linear regression. The second scatterplot colors based on the FirstClassification of the dog breeds. The coloring shows there is some distinction between the groupings across PC1. Toy breeds tend to score lower on PC1, with working dogs scoring higher on PC1. In between, the other 5 groups are overlapped, but with slight clustering. This trend is similar in a plot of average height vs weight. There, the same observations can be noted: toy breeds are the smallest while working breeds tend to be the largest breeds of dogs. Across PC2, the values among the groups were generally between [-1,1], with some outliers that scored more negatively. The three breeds that scored highest on PC1 were the Tibetan Mastiff, Irish Wolfhound, and Mastiff. This means that across both height and weight, these three breeds scored the highest on both. On the other hand, the three breeds that scored the lowest on PC1 were the Chihuahua, Papillon, and the Japanese Chin. Across height and weight, these breeds scored the lowest on both variables. The three breeds that scored the highest on PC2 were the English Setter, Plott Hound, and Harrier. These breeds were more likely to be heavier than their expected weight for their height (i.e. they had a higher weight and a lower height proportion than other breeds). On the contrary, the three breeds that scored the lowest on PC2 were the French Bulldog, Portuguese Water Dog, and Sussex Spaniel. These breeds were more likely to be lighter than their expected height (i.e. they had a lower weight and a higher height proportion than other breeds). While average puppy price is included in this analysis, there is no clear evidence of how this variable affects the PCA analysis. This is seen in previous visualization plots that illustrate an weak correlation between puppy price and group/cluster distinctions.
# standardize data
numeric_dog_data <- dog_data_wider %>% select(Breed, AvgHeightInches,
AvgWeightLbs, AvgPupPrice, Watchdog) %>% na.omit()
temp <- dog_data_wider %>% select(Breed, FirstClassification,
AvgHeightInches, AvgWeightLbs, AvgPupPrice, Watchdog) %>%
na.omit()
numeric_dog_data <- numeric_dog_data %>% select(-Breed)
numeric_dog_data <- data.frame(scale(numeric_dog_data))
rownames(numeric_dog_data) <- temp$Breed
# PCA run-through
pca1 <- prcomp(numeric_dog_data, center = T, scale = T)
summary(pca1)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.4828 0.9433 0.8812 0.36764
## Proportion of Variance 0.5497 0.2224 0.1941 0.03379
## Cumulative Proportion 0.5497 0.7721 0.9662 1.00000
# Number of PCAs
dog_pca <- princomp(numeric_dog_data)
eigval <- dog_pca$sdev^2
varprop = round(eigval/sum(eigval), 2)
ggplot() + geom_bar(aes(y = varprop, x = 1:4), stat = "identity") +
xlab("") + geom_path(aes(y = varprop, x = 1:4)) + geom_text(aes(x = 1:4,
y = varprop, label = round(varprop, 2)), vjust = 1, col = "white",
size = 5) + scale_y_continuous(breaks = seq(0, 0.6, 0.2),
labels = scales::percent) + scale_x_continuous(breaks = 1:10)
round(cumsum(eigval)/sum(eigval), 2)
## Comp.1 Comp.2 Comp.3 Comp.4
## 0.55 0.77 0.97 1.00
# Plot PCA scores
dogdf <- data.frame(Name = temp$Breed, PC1 = dog_pca$scores[,
1], PC2 = dog_pca$scores[, 2])
ggplot(dogdf, aes(PC1, PC2)) + geom_point()
# Colored on first classification of breeds
temp <- temp %>% mutate(PC1 = dog_pca$scores[, 1]) %>% mutate(PC2 = dog_pca$scores[,
2])
temp %>% ggplot(aes(x = PC1, y = PC2, color = FirstClassification)) +
geom_point()
# highest on PC1
dog_pca$scores[, 1:4] %>% as.data.frame %>% top_n(3, Comp.1)
## Comp.1 Comp.2 Comp.3 Comp.4
## Tibetan Mastiff 4.049421 -2.6371733 -0.5326915 -1.0646570
## Mastiff 3.912426 0.5773868 -0.1083164 -1.5113643
## Great Dane 3.034720 1.0938982 0.4296456 -0.3981539
# lowest PC1
dog_pca$scores[, 1:4] %>% as.data.frame %>% top_n(-3, Comp.1)
## Comp.1 Comp.2 Comp.3 Comp.4
## Affenpinscher -2.288034 -0.56402439 0.89246433 -0.2371198
## Chihuahua -2.574198 -0.05569113 0.07232906 -0.4273983
## Toy Fox Terrier -2.421223 0.40697826 0.22676088 -0.2374032
# highest on PC2
dog_pca$scores[, 1:4] %>% as.data.frame %>% top_n(3, wt = Comp.2)
## Comp.1 Comp.2 Comp.3 Comp.4
## Great Pyrenees 2.4046602 1.349217 -0.54127308 -0.002333538
## Pointer 0.1077203 1.337919 -0.01285407 0.166004375
## Plott Hound -0.3265241 1.292278 0.55205369 0.161829939
# lowest on PC2
dog_pca$scores[, 1:4] %>% as.data.frame %>% top_n(3, wt = desc(Comp.2))
## Comp.1 Comp.2 Comp.3 Comp.4
## French Bulldog -0.3007871 -3.904418 -0.27913389 0.02799855
## Portuguese Water Dog 1.1918826 -2.697060 -0.16148442 0.49573228
## Sussex Spaniel 0.1308120 -2.779641 0.08636783 -0.03146677
By adding one more variable, watchdog rating, to the PCA analysis, the number of PCs needed to describe the entirety of the dataset increases to 4. Still, only PC1 and PC2 are needed to describe 77% of the variance in the data, which is sufficient. Here, PC1 encompasses 55% and PC2 encompasses 22% of the variance. The subsequent bar plot illustrates the proportion of variance described by each PC. While there was an addition of another variable, the two scatterplots are the same as those in the previous section. The three breeds that scored highest on PC1 were the Tibetan Mastiff, Mastiff, and Great Dane. This means that across both height and weight, these three breeds scored the highest on both. On the other hand, the three breeds that scored the lowest on PC1 were the Affenpinscher, Chihuahua, and Toy Fox Terrier. Across height and weight, these breeds scored the lowest on both variables. The three breeds that scored the highest on PC2 were the Great Pyrenees, Pointer, and Plott Hound. These breeds were more likely to be heavier than their expected weight for their height (i.e. they had a higher weight and a lower height proportion than other breeds). On the contrary, the three breeds that scored the lowest on PC2 were the French Bulldog, Portuguese Water Dog, and Sussex Spaniel (consistent with the previous section). These breeds were more likely to be lighter than at their expected height (i.e. they had a lower weight and a higher height proportion than other breeds). While average puppy price and watchdog rating are included in this analysis, there is no clear evidence of how these variables affect the PCA analysis. This is particularly seen in previous visualization plots that illustrate an weak correlation between puppy price/watchdog rating and group/cluster distinctions.
# create numeric data without NA-heavy variables
numeric_dog_data <- dog_data_wider %>% select(-IntelligenceRating) %>%
select(is.numeric) %>% na.omit()
loyal_dog_data <- numeric_dog_data %>% select(is.numeric) %>%
select(1:9, Loyal)
logistic_fit <- glm(Loyal ~ ., data = loyal_dog_data, family = "binomial")
prob_reg <- predict(logistic_fit, type = "response")
class_diag(prob_reg, loyal_dog_data$Loyal, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.7596 0.4571 0.913 0.7273 0.5614 0.6851 0.7681
# Confusion Matrix
table(actual = factor(loyal_dog_data$Loyal == 1, levels = c("TRUE",
"FALSE")), predicted = factor(prob_reg > 0.5, levels = c("TRUE",
"FALSE"))) %>% addmargins
## predicted
## actual TRUE FALSE Sum
## TRUE 16 19 35
## FALSE 6 63 69
## Sum 22 82 104
Using a linear classifier, the model was trained on all the numeric variables (excluding IntelligenceRating because of too many NA values) for the entire dataset. The AUC was 0.7681, indicating the model was decently successful with predicting the binary variable of Loyal. There were 16 true positives, 19 false negatives (type II error), 6 false positives (type I error), and 63 true negatives. The true positive rate (tpr/“sensitivity”) is 0.457, indicating that a subpar proportion of true positives were correctly predicted. The true negative rate (tnr/“specificity”) is 0.913, indicating that a high proportion of true negatives were correctly predicted. The positive predicted value (“precision”) is 0.727, indicating that a high proportion of cases predicted positives that were actually positive.
affectionate_dog_data <- numeric_dog_data %>% select(is.numeric) %>%
select(1:9, Affectionate)
logistic_fit <- glm(Affectionate ~ ., data = affectionate_dog_data,
family = "binomial")
prob_reg <- predict(logistic_fit, type = "response")
class_diag(prob_reg, affectionate_dog_data$Affectionate, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.6923 0.1471 0.9571 0.625 0.2381 0.5521 0.6416
# Confusion Matrix
table(actual = factor(affectionate_dog_data$Affectionate == 1,
levels = c("TRUE", "FALSE")), predicted = factor(prob_reg >
0.5, levels = c("TRUE", "FALSE"))) %>% addmargins
## predicted
## actual TRUE FALSE Sum
## TRUE 5 29 34
## FALSE 3 67 70
## Sum 8 96 104
Using a linear classifier, the model was trained on all the numeric variables (excluding IntelligenceRating because of too many NA values) for the entire dataset. The AUC was 0.6416, indicating the model was reasonably successful with predicting the binary variable of Affectionate. There were 5 true positives, 29 false negatives (type II error), 3 false positives (type I error), and 67 true negatives. The true positive rate (tpr/“sensitivity”) is 0.147, indicating that a low proportion of true positives were correctly predicted. The true negative rate (tnr/“specificity”) is 0.957, indicating that a high proportion of true negatives were correctly predicted. The positive predicted value (“precision”) is 0.625, indicating that a high proportion of cases predicted positives that were actually positive.
friendly_dog_data <- numeric_dog_data %>% select(is.numeric) %>%
select(1:9, Friendly)
logistic_fit <- glm(Friendly ~ ., data = friendly_dog_data, family = "binomial")
prob_reg <- predict(logistic_fit, type = "response")
class_diag(prob_reg, friendly_dog_data$Friendly, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.7212 0.25 0.9306 0.6154 0.3556 0.5903 0.7122
# Confusion Matrix
table(actual = factor(friendly_dog_data$Friendly == 1, levels = c("TRUE",
"FALSE")), predicted = factor(prob_reg > 0.5, levels = c("TRUE",
"FALSE"))) %>% addmargins
## predicted
## actual TRUE FALSE Sum
## TRUE 8 24 32
## FALSE 5 67 72
## Sum 13 91 104
Using a linear classifier, the model was trained on all the numeric variables (excluding IntelligenceRating because of too many NA values) for the entire dataset. The AUC was 0.7122, indicating the model was successful with predicting the binary variable of Friendly. There were 8 true positives, 24 false negatives (type II error), 5 false positives (type I error), and 67 true negatives. The true positive rate (tpr/“sensitivity”) is 0.25, indicating that a low proportion of true positives were correctly predicted. The true negative rate (tnr/“specificity”) is 0.931, indicating that a high proportion of true negatives were correctly predicted. The positive predicted value (“precision”) is 0.615, indicating that a fair proportion of cases predicted positives that were actually positive.
set.seed(322)
k = 10
data <- sample_frac(loyal_dog_data)
folds <- rep(1:k, length.out = nrow(data))
diags <- NULL
for (i in 1:k) {
# create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Loyal
# train model
fit <- glm(Loyal ~ ., data = train, family = "binomial")
# test model
probs <- predict(fit, newdata = test, type = "response")
# get performance metrics for each fold
diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}
# average performance metrics across all folds
summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.71183 0.35667 0.8475 NaN NaN 0.60208 0.68143
Using the linear classifier, the AUC was 0.7681, however after cross-validation, the AUC dropped to 0.68143. This drop in AUC indicates 2 things: the model when trained to the entire dataset does not perform as well as expected and that there are signs of overfitting (i.e. the first model was trained to fit too specific to the quirks of the entire dataset). That said, both the linear classifier model and cross-validation model through k-folds show reasonable success at predicting the binary variable Loyal.
set.seed(322)
k = 10
data <- sample_frac(affectionate_dog_data)
folds <- rep(1:k, length.out = nrow(data))
diags <- NULL
for (i in 1:k) {
# create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Affectionate
# train model
fit <- glm(Affectionate ~ ., data = train, family = "binomial")
# test model
probs <- predict(fit, newdata = test, type = "response")
# get performance metrics for each fold
diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}
# average performance metrics across all folds
summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.61637 0.08333 0.90417 NaN NaN 0.49375 0.40428
Using the linear classifier, the AUC was 0.6416, however after cross-validation, the AUC dropped to 0.40428. This drop in AUC indicates 2 things: the model when trained to the entire dataset does not perform as well as expected and that there are signs of overfitting (i.e. the first model was trained to fit too specific to the quirks of the entire dataset). While the linear classifier model appeared to perform reasonably when trained to the entire dataset, the cross-validation shows that the actual model makes weak predictions for the binary variable Affectionate.
set.seed(322)
k = 10
data <- sample_frac(friendly_dog_data)
folds <- rep(1:k, length.out = nrow(data))
diags <- NULL
for (i in 1:k) {
# create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Friendly
# train model
fit <- glm(Friendly ~ ., data = train, family = "binomial")
# test model
probs <- predict(fit, newdata = test, type = "response")
# get performance metrics for each fold
diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}
# average performance metrics across all folds
summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.67092 0.28333 0.86095 0.47333 NaN 0.57213 0.64821
Using the linear classifier, the AUC was 0.7122, however after cross-validation, the AUC dropped to 0.64821. This drop in AUC indicates 2 things: the model when trained to the entire dataset does not perform as well as expected and that there are signs of overfitting (i.e. the first model was trained to fit too specific to the quirks of the entire dataset). That said, both the linear classifier model and cross-validation model through k-folds show reasonable success at predicting the binary variable Friendly.
library(caret)
knn_fit <- knn3(Loyal == 1 ~ ., data = loyal_dog_data)
prob_knn <- predict(knn_fit, loyal_dog_data)
class_diag(prob_knn[, 2], loyal_dog_data$Loyal, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.8077 0.6571 0.8841 0.7419 0.697 0.7706 0.846
# Confusion Matrix
table(actual = factor(loyal_dog_data$Loyal == 1, levels = c("TRUE",
"FALSE")), predicted = factor(prob_reg > 0.5, levels = c("TRUE",
"FALSE"))) %>% addmargins
## predicted
## actual TRUE FALSE Sum
## TRUE 3 32 35
## FALSE 10 59 69
## Sum 13 91 104
Using a non-parametric classifier (i.e. k-nearest-neighbors), the model was trained on all the numeric variables (excluding IntelligenceRating because of too many NA values) for the entire dataset. The AUC was 0.846, indicating the model was successful with predicting the binary variable of Loyal. There were 3 true positives, 32 false negatives (type II error), 10 false positives (type I error), and 59 true negatives. The true positive rate (tpr/“sensitivity”) is 0.086, indicating that a very low proportion of true positives were correctly predicted. The true negative rate (tnr/“specificity”) is 0.855, indicating that a high proportion of true negatives were correctly predicted. The positive predicted value (“precision”) is 0.231, indicating that a low proportion of cases predicted positives that were actually positive.
knn_fit <- knn3(Affectionate == 1 ~ ., data = affectionate_dog_data)
prob_knn <- predict(knn_fit, affectionate_dog_data)
class_diag(prob_knn[, 2], affectionate_dog_data$Affectionate,
positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.75 0.5 0.8714 0.6538 0.5667 0.6857 0.7725
# Confusion Matrix
table(actual = factor(affectionate_dog_data$Affectionate == 1,
levels = c("TRUE", "FALSE")), predicted = factor(prob_reg >
0.5, levels = c("TRUE", "FALSE"))) %>% addmargins
## predicted
## actual TRUE FALSE Sum
## TRUE 3 31 34
## FALSE 10 60 70
## Sum 13 91 104
Using a non-parametric classifier (i.e. k-nearest-neighbors), the model was trained on all the numeric variables (excluding IntelligenceRating because of too many NA values) for the entire dataset. The AUC was 0.7725, indicating the model was successful with predicting the binary variable of Loyal. There were 3 true positives, 31 false negatives (type II error), 10 false positives (type I error), and 60 true negatives. The true positive rate (tpr/“sensitivity”) is 0.088, indicating that a very low proportion of true positives were correctly predicted. The true negative rate (tnr/“specificity”) is 0.857, indicating that a high proportion of true negatives were correctly predicted. The positive predicted value (“precision”) is 0.231, indicating that a low proportion of cases predicted positives that were actually positive.
knn_fit <- knn3(Friendly == 1 ~ ., data = friendly_dog_data)
prob_knn <- predict(knn_fit, friendly_dog_data)
class_diag(prob_knn[, 2], friendly_dog_data$Friendly, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.75 0.4375 0.8889 0.6364 0.5185 0.6632 0.8082
# Confusion Matrix
table(actual = factor(friendly_dog_data$Friendly == 1, levels = c("TRUE",
"FALSE")), predicted = factor(prob_reg > 0.5, levels = c("TRUE",
"FALSE"))) %>% addmargins
## predicted
## actual TRUE FALSE Sum
## TRUE 8 24 32
## FALSE 5 67 72
## Sum 13 91 104
Using a non-parametric classifier (i.e. k-nearest-neighbors), the model was trained on all the numeric variables (excluding IntelligenceRating because of too many NA values) for the entire dataset. The AUC was 0.8082, indicating the model was successful with predicting the binary variable of Loyal. There were 8 true positives, 24 false negatives (type II error), 5 false positives (type I error), and 67 true negatives. The true positive rate (tpr/“sensitivity”) is 0.25, indicating that a low proportion of true positives were correctly predicted. The true negative rate (tnr/“specificity”) is 0.931, indicating that a high proportion of true negatives were correctly predicted. The positive predicted value (“precision”) is 0.615, indicating that a fair proportion of cases predicted positives that were actually positive.
set.seed(322)
k = 10
data <- sample_frac(loyal_dog_data)
folds <- rep(1:k, length.out = nrow(data))
diags <- NULL
for (i in 1:k) {
# create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Loyal
# train model
fit <- knn3(Loyal == 1 ~ ., data = train)
# test model
probs <- predict(fit, newdata = test)[, 2]
# get performance metrics for each fold
diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}
# average performance metrics across all folds
summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.68183 0.41666 0.84964 NaN NaN 0.63316 0.73137
Using the non-parametric classifier, the AUC was 0.846, however after cross-validation, the AUC dropped to 0.73137. This drop in AUC indicates 2 things: the model when trained to the entire dataset does not perform as well as expected and that there are signs of overfitting (i.e. the first model was trained to fit too specific to the quirks of the entire dataset). That said, both the non-parametric classifier model and cross-validation model through k-folds show significant success at predicting the binary variable Loyal. Compared to the linear classifier, the non-parametric classifier performed better across both instances: training over the entire dataset and through k-folds cross validation.
set.seed(322)
k = 10
data <- sample_frac(affectionate_dog_data)
folds <- rep(1:k, length.out = nrow(data))
diags <- NULL
for (i in 1:k) {
# create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Affectionate
# train model
fit <- knn3(Affectionate == 1 ~ ., data = train)
# test model
probs <- predict(fit, newdata = test)[, 2]
# get performance metrics for each fold
diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}
# average performance metrics across all folds
summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.5791 0.22357 0.8002 NaN NaN 0.51189 0.53956
Using the non-parametric classifier, the AUC was 0.7725, however after cross-validation, the AUC dropped to 0.53956. This drop in AUC indicates 2 things: the model when trained to the entire dataset does not perform as well as expected and that there are signs of overfitting (i.e. the first model was trained to fit too specific to the quirks of the entire dataset). Like the linear classifier prediction for Affectionate, the non-parametric classifier prediction also has significant overfitting. While the non-parametric model appeared to perform reasonably when trained to the entire dataset, the cross-validation shows that the actual model makes significantly less accurate predictions for the binary variable Affectionate. That said, compared to the linear classifier, the non-parametric classifier also performed better across both instances: training over the entire dataset and through k-folds cross validation (similar to the non-parametric classifier for Loyal.
set.seed(322)
k = 10
data <- sample_frac(friendly_dog_data)
folds <- rep(1:k, length.out = nrow(data))
diags <- NULL
for (i in 1:k) {
# create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Friendly
# train model
fit <- knn3(Friendly == 1 ~ ., data = train)
# test model
probs <- predict(fit, newdata = test)[, 2]
# get performance metrics for each fold
diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}
# average performance metrics across all folds
summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.59273 0.13333 0.8246 NaN NaN 0.47896 0.57603
Using the non-parametric classifier, the AUC was 0.8082, however after cross-validation, the AUC dropped to 0.57603. This drop in AUC indicates 2 things: the model when trained to the entire dataset does not perform as well as expected and that there are signs of overfitting (i.e. the first model was trained to fit too specific to the quirks of the entire dataset). Similar to that of the previous section, this cross-validation model shows significant overfitting and a drastic decrease in AUC. While the non-parametric predictions appeared to perform well, in actuality, the k-fold cross validation indicates the model only performs weakly/reasonably at best. Compared to the linear classifier, the non-parametic classifier performed better when trained with the entire dataset. That said, when cross-validated with k-folds, the linear classifier model performs better. For the instance of Friendly binary predictions, the linear classifier model shows less overfitting than the non-parametric classifier.
# predicted 'loyal' from all other variables
linear_fit <- lm(Loyal ~ ., data = loyal_dog_data)
# predicted 'loyal'
yhat <- predict(linear_fit)
# mean squared error (MSE)
mean((loyal_dog_data$Loyal - yhat)^2)
## [1] 0.1742844
The measure of prediction error for the linear regression model for the binary variable Loyal is 0.1742844.
# predicted 'affectionate' from all other variables
linear_fit <- lm(Affectionate ~ ., data = affectionate_dog_data)
# predicted 'affectionate'
yhat <- predict(linear_fit)
# mean squared error (MSE)
mean((affectionate_dog_data$Affectionate - yhat)^2)
## [1] 0.2072373
The measure of prediction error for the linear regression model for the binary variable Affectionate is 0.2072373.
# predicted 'friendly' from all other variables
linear_fit <- lm(Friendly ~ ., data = friendly_dog_data)
# predicted 'friendly'
yhat <- predict(linear_fit)
# mean squared error (MSE)
mean((friendly_dog_data$Friendly - yhat)^2)
## [1] 0.1875234
The measure of prediction error for the linear regression model for the binary variable Friendly is 0.1875234.
set.seed(322)
k = 10
data <- sample_frac(loyal_dog_data)
folds <- rep(1:k, length.out = nrow(data))
diags <- NULL
for (i in 1:k) {
# create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Loyal
# train model
fit <- lm(Loyal ~ ., data = train)
# test model
yhat <- predict(fit, newdata = test)
# get performance metrics for each fold
diags <- mean((test$Loyal - yhat)^2)
}
# average MSE across all folds
mean(diags)
## [1] 0.2403649
The initial estimate for prediction error was 0.1742844, however, cross-validation using k-folds shows that the prediction error is closer to 0.2403649. The increase in prediction error during cross-validation indicates the linear regression model was more erroneous when the model was trained with the entire dataset. This is due to overfitting of the training model onto the given dataset, which learned the quirks of the dataset. However, when applied to smaller test sets, the quirks of the larger dataset do not lend for accurate prediction for normal data observations.
set.seed(322)
k = 10
data <- sample_frac(affectionate_dog_data)
folds <- rep(1:k, length.out = nrow(data))
diags <- NULL
for (i in 1:k) {
# create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Affectionate
# train model
fit <- lm(Affectionate ~ ., data = train)
# test model
yhat <- predict(fit, newdata = test)
# get performance metrics for each fold
diags <- mean((test$Affectionate - yhat)^2)
}
# average MSE across all folds
mean(diags)
## [1] 0.1959573
The initial estimate for prediction error was 0.2072373. Unlike the linear regression model for Loyal, the cross-validation indicates the prediction error is only 0.1959573, which is less than the estimated prediction error. In this case, there is not overfitting as the trained model on the whole daatset and the k-folds cross-validation are very similar, with the former performing less erroneous than that of the latter.
set.seed(322)
k = 10
data <- sample_frac(friendly_dog_data)
folds <- rep(1:k, length.out = nrow(data))
diags <- NULL
for (i in 1:k) {
# create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Friendly
# train model
fit <- lm(Friendly ~ ., data = train)
# test model
yhat <- predict(fit, newdata = test)
# get performance metrics for each fold
diags <- mean((test$Friendly - yhat)^2)
}
# average MSE across all folds
mean(diags)
## [1] 0.2013954
The initial estimate for prediction error was 0.1875234, however, the k-folds cross-validation indicates that the prediction error is actually higher, the value being 0.2013954. The increase in prediction error resulted during the cross validation of the linear regression model shows the trained model on the whole dataset was slightly more erroneous than actuality. This further indicates that there is slight overfitting of the model when trained over the entire dataset. That said, overall, the linear regression models for all three binary variables performed fairly well, with low prediction error values.
library(reticulate)
use_python("/usr/bin/python3", required = F)
import pandas as pd
pd.set_option('display.max_columns', None)
dog_data = r.dog_data_wider
dog_data = dog_data.sort_values(by= ["FirstClassification", "Breed"])
dog_data.head()
## Breed AvgHeightInches LowHeightInches HighHeightInches \
## 29 Bearded Collie 21.0 20.0 22.0
## 25 Beauceron 25.5 24.0 27.0
## 81 Border Collie 20.0 19.0 21.0
## 37 Briard 25.0 23.0 27.0
## 46 Canaan Dog 21.5 19.0 24.0
##
## AvgWeightLbs LowWeightLbs HighWeightLbs AvgPupPrice \
## 29 50.0 40.0 60.0 1300.0
## 25 110.0 100.0 120.0 1350.0
## 81 40.0 40.0 40.0 700.0
## 37 75.0 74.0 76.0 1100.0
## 46 45.0 35.0 55.0 1000.0
##
## IntelligenceRating Watchdog FirstClassification SecondClassification \
## 29 34.0 4.0 Herding Herding
## 25 NaN 5.0 Herding Herding
## 81 1.0 6.0 Herding Herding
## 37 30.0 6.0 Herding Gun
## 46 NaN 3.0 Herding Sight
##
## Alert Friendly Affectionate Loyal Gentle Independent Protective \
## 29 1.0 0.0 0.0 0.0 0.0 0.0 0.0
## 25 0.0 0.0 0.0 1.0 0.0 0.0 1.0
## 81 1.0 0.0 0.0 1.0 0.0 0.0 1.0
## 37 0.0 0.0 0.0 1.0 0.0 0.0 1.0
## 46 1.0 0.0 0.0 0.0 0.0 0.0 0.0
##
## Playful
## 29 0.0
## 25 0.0
## 81 0.0
## 37 0.0
## 46 0.0
type(dog_data)
## <class 'pandas.core.frame.DataFrame'>
head(py$dog_data)
## Breed AvgHeightInches LowHeightInches HighHeightInches AvgWeightLbs
## 29 Bearded Collie 21.0 20 22 50
## 25 Beauceron 25.5 24 27 110
## 81 Border Collie 20.0 19 21 40
## 37 Briard 25.0 23 27 75
## 46 Canaan Dog 21.5 19 24 45
## LowWeightLbs HighWeightLbs AvgPupPrice IntelligenceRating Watchdog
## 29 40 60 1300 34 4
## 25 100 120 1350 NA 5
## 81 40 40 700 1 6
## 37 74 76 1100 30 6
## 46 35 55 1000 NA 3
## FirstClassification SecondClassification Alert Friendly Affectionate Loyal
## 29 Herding Herding 1 0 0 0
## 25 Herding Herding 0 0 0 1
## 81 Herding Herding 1 0 0 1
## 37 Herding Gun 0 0 0 1
## 46 Herding Sight 1 0 0 0
## Gentle Independent Protective Playful
## 29 0 0 0 0
## 25 0 0 1 0
## 81 0 0 1 0
## 37 0 0 1 0
## 46 0 0 0 0
## [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
Here, the R-based dataframe dog_data_wider is established as a python [pandas] dataframe. Simple data wrangling is demonstrated on the python dataframe dog_data to arrange the observations by FirstClassification and then by Breed name. This python dataframe is then converted back to an R dataframe and run through the head() function.
Here’s two more of my favorite pictures of my dog HoiSum!